This new communication system offers reduced configuration and deployment complexity, improved security and performance and a simpler, easier to maintain design. The biggest new features are unidirectional socket listening (server no longer needs to be able to open socket to agent), performance/scalability via nio/aio and simple to use SSL based authorization and encryption.
The agent communications and configuration systems have some weaknesses that can be remedied with an overhaul of the config and connection management system. This proposal entails replacing the current communications model based on JBoss Remoting 2.x with a new system utilizing the HornetQ messaging system.
Agent and server both need open listening ports making typical setups with firewalls difficult and insecure
The agent can finish the configuration wizard but still be misconfigured
Most commonly the agent address is unreachable from the server
The server public address (different from the entered server address) is unreachable from the agent
Configuring for SSL requires the use of the advanced config system and is overly complicated
You need to know that you need the advanced settings to do SSL which is not obvious
Servers don't verify agent reachability during wizard leading to common config failures that are not messaged
Server should require an administrator password to register with the server
No management of Agent SSL certs for client auth from the server
Need better all around error checking and messaging
People forget to use the agent-wrapper because you have to kill the agent after configuring (non-intuitive)
Be easy to use without reading install docs
Be backward compatible with existing agent installs
Agent connects to server only (no server to agent comms for firewalls)
Rock solid SSL (including managed agent certs)
Agent config will auto-restart using service wrapper on completion
Proposed configuration steps will be:
Enter server name
Choose non-secure / ssl
Enter server user / password (* if server responds that auth to register required)
confirm or choose advanced settings
jboss-remoting 2.x is unable to support one open port from agent to server with no open listeners on the agent. It does have bisocket which has a comms port plus a control port open on the server, but this doesn't appear to be http tunneled at all and doesn't seem like it would scale with blocking IO to 100's of servers. Each agent would essentially hold open two connections to the server and they'd be blocking threads.
jboss-remoting 3.x is not yet complete and hasn't had a GA release, but it does appear to have pull callbacks via a single connection from agent to server. It does appear to support xnio when using it's own socket but doesn't seem to support this when http tunneled. It would probably give us what we need to scale to 100s of agents, but it's very difficult to be sure given the state of the project and complete lack of documentation.
Remoting has always been lower-level then we hoped to use, but our needs include control over geographic routing and connection management for server-side caching that have been to date infeasible in messaging technologies. HornetQ may just be far enough along at this point to give us what we need. It has pluggable routing for geography and failover, pluggable connection management, pluggable security and would give us agent->server comms on a single port over tunneled http with ssl. It uses netty(nio, aio) for non-blocking connection scalability on the server and has been extensively performance tested and optimized. It doesn't have client-side reliability of the type that we need, but I think we'll be able to work around this issue with blocking for ack sends on reliable messages that would timeout into our standard client message persistence system.
JBoss Remoting is the primary alternative as mentioned above, but there are other technologies that could be used. Some other messaging solutions offer similar semantics and features, but I can't find any that offer similar ease of integration, performance and extensibility. Having looked at projects such as RabbitMQ, ActiveMQ,
HornetQ will be embedded in the server and will be programmatically started up to create a HornetQ cluster with the server managed cluster i.p. addresses. Agent's will have reliable and non-reliable queues that will be addressed by agent_id registered at agent registration. This should let any server route through other servers to send calls down to the agents connected to them at any given time. The client failover list will be programmatically configured on the agent as it is today with our server's geographic config logic. Failover listeners will be used to jump in a register to the new server before new messages are sent to it. (for cache warmup and connection logic)
Default comms will be over SSL to port 7444 and will also be used for the CLI's remoting comms. This has the downside of being a new port to open from agents, but in most cases outbound firewalls from agent's are unlikely and the server's are but one place to configure. We could use the servlet tunneler, but we'd lose all nio/aio causing severe agent count limitations.
Multicast for server clusters and for agent-server discover will not be used at all as by far the most common scenarios would be unable to use it and we've already got a server clustering system that supports managed real-world scenarios.
Disabling custom throttling for the moment and trying the Producer throttling built into hornetq
Getting rid of client sending "connect" command to server model... instead the server will detect that it's handling a request from an agent its not currently connected to.
The command system in comms is being dropped in favor of a direct to pojo invoker model. This should reduce the complexity of the module now that pojo's are our primary call mechanism and only a few will have to be ported over to pojo calls.
The server should see improved performance and agent scalability through the use of non-blocking io for incoming agent connections and the lack of a need to make a server to agent connection. So far, testing is not indicating much if any slow down of server-to-agent messages and I'm able to make round-trip requests with reply's in under a couple of milliseconds. This means we need not change the assumed semantics for server side code calling into an agent. The same ping infrastructure works for verifying that an agent is fully reachable before sending messages (e.g. to coordinate cluster operations).
Another major performance advantage should come from the high-performance persistent journal that HornetQ integrates. This means we can keep our reliability for messaging needs, push data quickly to the server (rather than throttling exclusively on the agent) and still see high-performance all without the need for further database IO. By moving our other JMS uses to HornetQ we should also see a marked performance improvement over our current messaging system and reduced impact on the database.
We have the option, with hornetq of using HA backup nodes for the server endpoints to provide reliability of the queues once messages are delivered. However, this has the simplicity cost of the loss of homogeneity in server installations as these are warm backups and not live. It is also protecting against the loss of data during the short period it is in transfer to the server or in responses back to the client. With the latter case being relatively safe in that the client will know it fails when a response is not received in the timeout.
Since hornetq is resident in the RHQ JVM if its down the RHQ server is likely down and vice-versa and therefore the agents will immediately switch to another server endpoint on messaging connection error. Since we're not using HornetQ's failover mechanism it won't manage switching the session and consumers over, but on the agent side we have our offline buffer and RPC response receiving systems to deal with most scenarios for messages sent to the server but never processed.
The final piece is that the agent comms to the server are much more fragile then other chains in this architecture and this new solution has the added advantage of being able to get more data to the server without worry about the server necessarily processing it immediately. This is actually an advantage for guaranteed messages. For reply type messages we have the additional advantage of queue priorities in order to ask the server to process these time sensitive requests earlier then other asynchronous requests.
The summary is that each server will be a non-HA HornetQ server endpoint integrated into a total cluster for the whole RHQ cluster. Agents will manage failover themselves to each server endpoint. Each RHQ server will use an InVM connection to their embedded HornetQ server. HornetQ will also ensure that any server can communicate with any agent by routing messages to the server that the agent is listening on at any given point. (See HornetQ Docs Ch. 38)
The command concurrency system is meant to be removed from this implementation for several reasons. It was originally designed to make sure that critical commands could get through during high-load periods, but to cooperate with sending agents so that non-time sensitive commands could get through at a later period. It is effectively a delivery priority system. It is fairly complex because it needs cooperation from both agents and servers as the previous communications is real-time RPC and the servers need the agents to back off sending in order to throttle the work.
The new system can get away without this complexity because of the server-side delivery queue along with priorities on the queues. This means agents can send data as fast as necessary (within reason) and the servers can process important requests (like RPC or replys) before it get's around to the "eventual" delivery stuff like metrics. The primary scenario that caused the need for the original system was server outtages (planned or unplanned) that would cause the agents to build up a queue of messages to send to the server. The servers would be inundated when they came back up causing further problems. The new system allows connection and priority messages to flow fine while allowing for eventual delivery of other messages.
There will be a few tricks to an upgrade from servers and agents of the old system to the new ones. The biggest issue is the port on which agents communicate with the server. The new HornetQ system listens on a new port rather than tunneling over the web-request port as the old agents did. This means that the agent configurations will need to be altered and that the open ports from agent-to-server will need to be updated on any firewalls in between. The configurations can be auto-upgraded to the "default" server listening port reasonably easy mitigating that issue. But there will be a need to ensure the new port on the server is widely accessible. As this is a one-way connection though it should be reasonably simple in most environments. (This point is worth discussing with current users before final design)
The rest of the agent upgrade system may be possible to maintain so that agents can upgrade themselves to the new system. The agent download and all the rest is meant to work as before. The only thing that would be needed is a stub in the new server that always throws the AgentNotSupportedException to old JBossRemoting based agents so that they attempt an upgrade. I've not implemented this stub yet, but I don't see it being a big problem.
As for upgrading the servers, one enhancement worth looking at would be to allow the server to finish processing its local RPC queue on shutdown as the server after upgrade may not be serial compatible with the requests. This is not much different than the state of the agents during the current upgrade cycle and is currently handled the same way with the potential loss of a small amount of data during the upgrade. This will be mitigated as well by maintaining serial compatibility wherever feasible during upgrades.
Having a proper clustered messaging system in the RHQ server affords the opportunity to simplify some of the other infrastructure in use today. Alert processing can switch to this from the out of date JBM which will let us simplify installation through the removal of the JMS DB tables and associated rdbms load when processing alerts. We can also simplify the server coordination of things like alert condition cache rebuilds and server config change coordination which is currently through scheduled db checks. We also have the option of allowing a subset of messages to load balance themselves to other servers (in the case of non-connect related items). The CLI remote client will also need to be reworked to utilize this RPC system.
See also Design - Agent to Server Unidirectional Communications